Fmoe_fp32_pro #1923

zufayu · 2026-01-28T11:12:56Z

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

Adds an FP32-output (likely FP32-accumulation) path for the fused MoE g1u1 per-token int8 SiLU kernel on gfx942, wiring a new hsaco + CSV config into the existing heuristic kernel-selection pipeline and exposing it through the Python/C++ APIs.

Changes:

Add gfx942 FP32 per-token int8 g1u1 SiLU hsaco (.co) and its kernel list CSV.
Extend fmoe_g1u1_a16 (C++/HIP) to select an FP32 config map and launch with T_O=float when out is FP32.
Update Python asm_moe to optionally allocate moe_buf as FP32 and cast back to the original dtype after execution.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.

File	Description
`hsa/gfx942/fmoe/silu/fmoe_fp32_pertokenInt8_g1u1_vs_smf_silu_1tg_32x384.co`	Adds the gfx942 FP32 SiLU g1u1 per-token int8 assembled kernel blob.
`hsa/gfx942/fmoe/silu/fmoe_fp32_pertokenInt8_g1u1_silu.csv`	Registers the new FP32 kernel in the asm config generation pipeline.
`csrc/py_itfs_cu/asm_fmoe.cu`	Enables selecting/launching the FP32-output kernel variant for `fmoe_g1u1_a16`.
`aiter/fused_moe_bf16_asm.py`	Adds a Python-side heuristic to allocate FP32 output buffer and convert back to the original dtype.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-28T11:24:05Z

aiter/fused_moe_bf16_asm.py

+    is_g1u1 = (
+        w2.shape[2] * 2 * lastdim_mul == w1.shape[1] and fc2_smooth_scale is not None
+    )
+    enable_fp32 = (
+        fc2_smooth_scale is not None
+        and is_g1u1
+        and (inter_dim % 384 == 0)
+        and w1.dtype == dtypes.i8
+        and a16
+    )
+    moebuf_dtype = torch.float32 if enable_fp32 else dtype


enable_fp32 can currently become True even when activation is GELU or when running on non-gfx942 GPUs. In those cases moe_buf is allocated as FP32 and routed into aiter.fmoe_g1u1_a16, but the C++ side only wires up FP32 output support for (int8 weights, Silu) and only has an FP32 hsaco listed for gfx942; this will reliably throw at runtime (unsupported activation/dtype or “No suitable kernel found”). Gate enable_fp32 at least on activation == ActivationType.Silu and the current arch (e.g. get_gfx() == "gfx942"), or add a safe fallback when the FP32 kernel isn’t available.

Copilot · 2026-01-28T11:24:05Z

csrc/py_itfs_cu/asm_fmoe.cu

+        else if(out.dtype() == at::ScalarType::Float && activation == ActivationType::Silu)
+            config_map = &cfg_fmoe_fp32_pertokenInt8_g1u1_silu;
        else


FP32 output is now selectable here, but the only shipped FP32 hsaco in this PR is for gfx942 and subGU_n=384 (see hsa/gfx942/fmoe/silu/fmoe_fp32_pertokenInt8_g1u1_silu.csv). If callers pass out as FP32 on other arches or with an inter_dim not divisible by 384, this will fail later with a generic “No suitable kernel found”. Consider adding an explicit early TORCH_CHECK documenting the FP32 constraints (arch + supported inter_dim tiles) so the failure mode is clearer.

amd-zfyu added 3 commits January 28, 2026 18:56

fix1

e7fb187

fix2

4417f43

fix3

c3c5e22

zufayu requested review from a team and Copilot January 28, 2026 11:12

Copilot started reviewing on behalf of zufayu January 28, 2026 11:13 View session

Copilot AI reviewed Jan 28, 2026

View reviewed changes

add ntile320

6b30e7b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fmoe_fp32_pro #1923

Fmoe_fp32_pro #1923

Uh oh!

zufayu commented Jan 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 28, 2026

Uh oh!

Copilot AI Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fmoe_fp32_pro #1923

Are you sure you want to change the base?

Fmoe_fp32_pro #1923

Uh oh!

Conversation

zufayu commented Jan 28, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants